InΒ [Β ]:
---
title: "Module 04: Lab 01"
subtitle: "Visual Reporting and Storytelling"
author: "Xinran Li"
number-sections: true
date: "2025-03-19"
date-modified: today
date-format: long
engine: jupyter
categories: ["visualization","plotly","spark","Visual Reporting","Storytelling with Data","Industry-Specific Visualization",]
execute: 
  eval: true
  echo: true
---

Step 1: Load the Dataset {.unnumbered}ΒΆ

InΒ [1]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

pio.renderers.default = "notebook"

# Initialize Spark Session
spark = SparkSession.builder.appName("LightcastData").getOrCreate()

# Load Data
df = spark.read.option("header", "true") \
               .option("inferSchema", "true") \
               .option("multiLine", "true") \
               .option("escape", "\"") \
               .csv("lightcast_data.csv")

# Show Schema and Sample Data
df.printSchema()
df.show(5)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/20 06:17:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                
root
 |-- ID: string (nullable = true)
 |-- LAST_UPDATED_DATE: string (nullable = true)
 |-- LAST_UPDATED_TIMESTAMP: timestamp (nullable = true)
 |-- DUPLICATES: integer (nullable = true)
 |-- POSTED: string (nullable = true)
 |-- EXPIRED: string (nullable = true)
 |-- DURATION: integer (nullable = true)
 |-- SOURCE_TYPES: string (nullable = true)
 |-- SOURCES: string (nullable = true)
 |-- URL: string (nullable = true)
 |-- ACTIVE_URLS: string (nullable = true)
 |-- ACTIVE_SOURCES_INFO: string (nullable = true)
 |-- TITLE_RAW: string (nullable = true)
 |-- BODY: string (nullable = true)
 |-- MODELED_EXPIRED: string (nullable = true)
 |-- MODELED_DURATION: integer (nullable = true)
 |-- COMPANY: integer (nullable = true)
 |-- COMPANY_NAME: string (nullable = true)
 |-- COMPANY_RAW: string (nullable = true)
 |-- COMPANY_IS_STAFFING: boolean (nullable = true)
 |-- EDUCATION_LEVELS: string (nullable = true)
 |-- EDUCATION_LEVELS_NAME: string (nullable = true)
 |-- MIN_EDULEVELS: integer (nullable = true)
 |-- MIN_EDULEVELS_NAME: string (nullable = true)
 |-- MAX_EDULEVELS: integer (nullable = true)
 |-- MAX_EDULEVELS_NAME: string (nullable = true)
 |-- EMPLOYMENT_TYPE: integer (nullable = true)
 |-- EMPLOYMENT_TYPE_NAME: string (nullable = true)
 |-- MIN_YEARS_EXPERIENCE: integer (nullable = true)
 |-- MAX_YEARS_EXPERIENCE: integer (nullable = true)
 |-- IS_INTERNSHIP: boolean (nullable = true)
 |-- SALARY: integer (nullable = true)
 |-- REMOTE_TYPE: integer (nullable = true)
 |-- REMOTE_TYPE_NAME: string (nullable = true)
 |-- ORIGINAL_PAY_PERIOD: string (nullable = true)
 |-- SALARY_TO: integer (nullable = true)
 |-- SALARY_FROM: integer (nullable = true)
 |-- LOCATION: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- CITY_NAME: string (nullable = true)
 |-- COUNTY: integer (nullable = true)
 |-- COUNTY_NAME: string (nullable = true)
 |-- MSA: integer (nullable = true)
 |-- MSA_NAME: string (nullable = true)
 |-- STATE: integer (nullable = true)
 |-- STATE_NAME: string (nullable = true)
 |-- COUNTY_OUTGOING: integer (nullable = true)
 |-- COUNTY_NAME_OUTGOING: string (nullable = true)
 |-- COUNTY_INCOMING: integer (nullable = true)
 |-- COUNTY_NAME_INCOMING: string (nullable = true)
 |-- MSA_OUTGOING: integer (nullable = true)
 |-- MSA_NAME_OUTGOING: string (nullable = true)
 |-- MSA_INCOMING: integer (nullable = true)
 |-- MSA_NAME_INCOMING: string (nullable = true)
 |-- NAICS2: integer (nullable = true)
 |-- NAICS2_NAME: string (nullable = true)
 |-- NAICS3: integer (nullable = true)
 |-- NAICS3_NAME: string (nullable = true)
 |-- NAICS4: integer (nullable = true)
 |-- NAICS4_NAME: string (nullable = true)
 |-- NAICS5: integer (nullable = true)
 |-- NAICS5_NAME: string (nullable = true)
 |-- NAICS6: integer (nullable = true)
 |-- NAICS6_NAME: string (nullable = true)
 |-- TITLE: string (nullable = true)
 |-- TITLE_NAME: string (nullable = true)
 |-- TITLE_CLEAN: string (nullable = true)
 |-- SKILLS: string (nullable = true)
 |-- SKILLS_NAME: string (nullable = true)
 |-- SPECIALIZED_SKILLS: string (nullable = true)
 |-- SPECIALIZED_SKILLS_NAME: string (nullable = true)
 |-- CERTIFICATIONS: string (nullable = true)
 |-- CERTIFICATIONS_NAME: string (nullable = true)
 |-- COMMON_SKILLS: string (nullable = true)
 |-- COMMON_SKILLS_NAME: string (nullable = true)
 |-- SOFTWARE_SKILLS: string (nullable = true)
 |-- SOFTWARE_SKILLS_NAME: string (nullable = true)
 |-- ONET: string (nullable = true)
 |-- ONET_NAME: string (nullable = true)
 |-- ONET_2019: string (nullable = true)
 |-- ONET_2019_NAME: string (nullable = true)
 |-- CIP6: string (nullable = true)
 |-- CIP6_NAME: string (nullable = true)
 |-- CIP4: string (nullable = true)
 |-- CIP4_NAME: string (nullable = true)
 |-- CIP2: string (nullable = true)
 |-- CIP2_NAME: string (nullable = true)
 |-- SOC_2021_2: string (nullable = true)
 |-- SOC_2021_2_NAME: string (nullable = true)
 |-- SOC_2021_3: string (nullable = true)
 |-- SOC_2021_3_NAME: string (nullable = true)
 |-- SOC_2021_4: string (nullable = true)
 |-- SOC_2021_4_NAME: string (nullable = true)
 |-- SOC_2021_5: string (nullable = true)
 |-- SOC_2021_5_NAME: string (nullable = true)
 |-- LOT_CAREER_AREA: integer (nullable = true)
 |-- LOT_CAREER_AREA_NAME: string (nullable = true)
 |-- LOT_OCCUPATION: integer (nullable = true)
 |-- LOT_OCCUPATION_NAME: string (nullable = true)
 |-- LOT_SPECIALIZED_OCCUPATION: integer (nullable = true)
 |-- LOT_SPECIALIZED_OCCUPATION_NAME: string (nullable = true)
 |-- LOT_OCCUPATION_GROUP: integer (nullable = true)
 |-- LOT_OCCUPATION_GROUP_NAME: string (nullable = true)
 |-- LOT_V6_SPECIALIZED_OCCUPATION: integer (nullable = true)
 |-- LOT_V6_SPECIALIZED_OCCUPATION_NAME: string (nullable = true)
 |-- LOT_V6_OCCUPATION: integer (nullable = true)
 |-- LOT_V6_OCCUPATION_NAME: string (nullable = true)
 |-- LOT_V6_OCCUPATION_GROUP: integer (nullable = true)
 |-- LOT_V6_OCCUPATION_GROUP_NAME: string (nullable = true)
 |-- LOT_V6_CAREER_AREA: integer (nullable = true)
 |-- LOT_V6_CAREER_AREA_NAME: string (nullable = true)
 |-- SOC_2: string (nullable = true)
 |-- SOC_2_NAME: string (nullable = true)
 |-- SOC_3: string (nullable = true)
 |-- SOC_3_NAME: string (nullable = true)
 |-- SOC_4: string (nullable = true)
 |-- SOC_4_NAME: string (nullable = true)
 |-- SOC_5: string (nullable = true)
 |-- SOC_5_NAME: string (nullable = true)
 |-- LIGHTCAST_SECTORS: string (nullable = true)
 |-- LIGHTCAST_SECTORS_NAME: string (nullable = true)
 |-- NAICS_2022_2: integer (nullable = true)
 |-- NAICS_2022_2_NAME: string (nullable = true)
 |-- NAICS_2022_3: integer (nullable = true)
 |-- NAICS_2022_3_NAME: string (nullable = true)
 |-- NAICS_2022_4: integer (nullable = true)
 |-- NAICS_2022_4_NAME: string (nullable = true)
 |-- NAICS_2022_5: integer (nullable = true)
 |-- NAICS_2022_5_NAME: string (nullable = true)
 |-- NAICS_2022_6: integer (nullable = true)
 |-- NAICS_2022_6_NAME: string (nullable = true)

25/03/20 06:18:11 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
|                  ID|LAST_UPDATED_DATE|LAST_UPDATED_TIMESTAMP|DUPLICATES|  POSTED|  EXPIRED|DURATION|        SOURCE_TYPES|             SOURCES|                 URL|ACTIVE_URLS|ACTIVE_SOURCES_INFO|           TITLE_RAW|                BODY|MODELED_EXPIRED|MODELED_DURATION| COMPANY|        COMPANY_NAME|COMPANY_RAW|COMPANY_IS_STAFFING|EDUCATION_LEVELS|EDUCATION_LEVELS_NAME|MIN_EDULEVELS| MIN_EDULEVELS_NAME|MAX_EDULEVELS|MAX_EDULEVELS_NAME|EMPLOYMENT_TYPE|EMPLOYMENT_TYPE_NAME|MIN_YEARS_EXPERIENCE|MAX_YEARS_EXPERIENCE|IS_INTERNSHIP|SALARY|REMOTE_TYPE|REMOTE_TYPE_NAME|ORIGINAL_PAY_PERIOD|SALARY_TO|SALARY_FROM|            LOCATION|                CITY|    CITY_NAME|COUNTY|   COUNTY_NAME|  MSA|            MSA_NAME|STATE|STATE_NAME|COUNTY_OUTGOING|COUNTY_NAME_OUTGOING|COUNTY_INCOMING|COUNTY_NAME_INCOMING|MSA_OUTGOING|   MSA_NAME_OUTGOING|MSA_INCOMING|   MSA_NAME_INCOMING|NAICS2|         NAICS2_NAME|NAICS3|         NAICS3_NAME|NAICS4|         NAICS4_NAME|NAICS5|         NAICS5_NAME|NAICS6|         NAICS6_NAME|             TITLE|         TITLE_NAME|         TITLE_CLEAN|              SKILLS|         SKILLS_NAME|  SPECIALIZED_SKILLS|SPECIALIZED_SKILLS_NAME|      CERTIFICATIONS| CERTIFICATIONS_NAME|       COMMON_SKILLS|  COMMON_SKILLS_NAME|     SOFTWARE_SKILLS|SOFTWARE_SKILLS_NAME|      ONET|           ONET_NAME| ONET_2019|      ONET_2019_NAME|                CIP6|           CIP6_NAME|                CIP4|           CIP4_NAME|                CIP2|           CIP2_NAME|SOC_2021_2|     SOC_2021_2_NAME|SOC_2021_3|     SOC_2021_3_NAME|SOC_2021_4|SOC_2021_4_NAME|SOC_2021_5|SOC_2021_5_NAME|LOT_CAREER_AREA|LOT_CAREER_AREA_NAME|LOT_OCCUPATION| LOT_OCCUPATION_NAME|LOT_SPECIALIZED_OCCUPATION|LOT_SPECIALIZED_OCCUPATION_NAME|LOT_OCCUPATION_GROUP|LOT_OCCUPATION_GROUP_NAME|LOT_V6_SPECIALIZED_OCCUPATION|LOT_V6_SPECIALIZED_OCCUPATION_NAME|LOT_V6_OCCUPATION|LOT_V6_OCCUPATION_NAME|LOT_V6_OCCUPATION_GROUP|LOT_V6_OCCUPATION_GROUP_NAME|LOT_V6_CAREER_AREA|LOT_V6_CAREER_AREA_NAME|  SOC_2|          SOC_2_NAME|  SOC_3|          SOC_3_NAME|  SOC_4|     SOC_4_NAME|  SOC_5|     SOC_5_NAME|LIGHTCAST_SECTORS|LIGHTCAST_SECTORS_NAME|NAICS_2022_2|   NAICS_2022_2_NAME|NAICS_2022_3|   NAICS_2022_3_NAME|NAICS_2022_4|   NAICS_2022_4_NAME|NAICS_2022_5|   NAICS_2022_5_NAME|NAICS_2022_6|   NAICS_2022_6_NAME|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
|1f57d95acf4dc67ed...|         9/6/2024|  2024-09-06 20:32:...|         0|6/2/2024| 6/8/2024|       6|   [\n  "Company"\n]|[\n  "brassring.c...|[\n  "https://sjo...|         []|               NULL|Enterprise Analys...|31-May-2024\n\nEn...|       6/8/2024|               6|  894731|          Murphy USA| Murphy USA|              false|       [\n  2\n]| [\n  "Bachelor's ...|            2|  Bachelor's degree|         NULL|              NULL|              1|Full-time (> 32 h...|                   2|                   2|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 33.20...|RWwgRG9yYWRvLCBBUg==|El Dorado, AR|  5139|     Union, AR|20980|       El Dorado, AR|    5|  Arkansas|           5139|           Union, AR|           5139|           Union, AR|       20980|       El Dorado, AR|       20980|       El Dorado, AR|    44|        Retail Trade|   441|Motor Vehicle and...|  4413|Automotive Parts,...| 44133|Automotive Parts ...|441330|Automotive Parts ...|ET29C073C03D1F86B4|Enterprise Analysts|enterprise analys...|[\n  "KS126DB6T06...|[\n  "Merchandisi...|[\n  "KS126DB6T06...|   [\n  "Merchandisi...|                  []|                  []|[\n  "KS126706DPF...|[\n  "Mathematics...|[\n  "KS440W865GC...|[\n  "SQL (Progra...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|[\n  "45.0601",\n...|[\n  "Economics, ...|[\n  "45.06",\n  ...|[\n  "Economics",...|[\n  "45",\n  "27...|[\n  "Social Scie...|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101011|           General ERP Analy...|                2310|     Business Intellig...|                     23101011|              General ERP Analy...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|        [\n  7\n]|  [\n  "Artificial ...|          44|        Retail Trade|         441|Motor Vehicle and...|        4413|Automotive Parts,...|       44133|Automotive Parts ...|      441330|Automotive Parts ...|
|0cb072af26757b6c4...|         8/2/2024|  2024-08-02 17:08:...|         0|6/2/2024| 8/1/2024|    NULL| [\n  "Job Board"\n]| [\n  "maine.gov"\n]|[\n  "https://job...|         []|               NULL|Oracle Consultant...|Oracle Consultant...|       8/1/2024|            NULL|  133098|Smx Corporation L...|        SMX|               true|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              1|Full-time (> 32 h...|                   3|                   3|        false|  NULL|          1|          Remote|               NULL|     NULL|       NULL|{\n  "lat": 44.31...|    QXVndXN0YSwgTUU=|  Augusta, ME| 23011|  Kennebec, ME|12300|Augusta-Watervill...|   23|     Maine|          23011|        Kennebec, ME|          23011|        Kennebec, ME|       12300|Augusta-Watervill...|       12300|Augusta-Watervill...|    56|Administrative an...|   561|Administrative an...|  5613| Employment Services| 56132|Temporary Help Se...|561320|Temporary Help Se...|ET21DDA63780A7DC09| Oracle Consultants|oracle consultant...|[\n  "KS122626T55...|[\n  "Procurement...|[\n  "KS122626T55...|   [\n  "Procurement...|                  []|                  []|                  []|                  []|[\n  "BGSBF3F508F...|[\n  "Oracle Busi...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101012|           Oracle Consultant...|                2310|     Business Intellig...|                     23101012|              Oracle Consultant...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          56|Administrative an...|         561|Administrative an...|        5613| Employment Services|       56132|Temporary Help Se...|      561320|Temporary Help Se...|
|85318b12b3331fa49...|         9/6/2024|  2024-09-06 20:32:...|         1|6/2/2024| 7/7/2024|      35| [\n  "Job Board"\n]|[\n  "dejobs.org"\n]|[\n  "https://dej...|         []|               NULL|        Data Analyst|Taking care of pe...|      6/10/2024|               8|39063746|            Sedgwick|   Sedgwick|              false|       [\n  2\n]| [\n  "Bachelor's ...|            2|  Bachelor's degree|         NULL|              NULL|              1|Full-time (> 32 h...|                   5|                NULL|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 32.77...|    RGFsbGFzLCBUWA==|   Dallas, TX| 48113|    Dallas, TX|19100|Dallas-Fort Worth...|   48|     Texas|          48113|          Dallas, TX|          48113|          Dallas, TX|       19100|Dallas-Fort Worth...|       19100|Dallas-Fort Worth...|    52|Finance and Insur...|   524|Insurance Carrier...|  5242|Agencies, Brokera...| 52429|Other Insurance R...|524291|    Claims Adjusting|ET3037E0C947A02404|      Data Analysts|        data analyst|[\n  "KS1218W78FG...|[\n  "Management"...|[\n  "ESF3939CE1F...|   [\n  "Exception R...|[\n  "KS683TN76T7...|[\n  "Security Cl...|[\n  "KS1218W78FG...|[\n  "Management"...|[\n  "KS126HY6YLT...|[\n  "Microsoft O...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231113|Data / Data Minin...|                  23111310|                   Data Analyst|                2311|     Data Analysis and...|                     23111310|                      Data Analyst|           231113|  Data / Data Minin...|                   2311|        Data Analysis and...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          52|Finance and Insur...|         524|Insurance Carrier...|        5242|Agencies, Brokera...|       52429|Other Insurance R...|      524291|    Claims Adjusting|
|1b5c3941e54a1889e...|         9/6/2024|  2024-09-06 20:32:...|         1|6/2/2024|7/20/2024|      48| [\n  "Job Board"\n]|[\n  "disabledper...|[\n  "https://www...|         []|               NULL|Sr. Lead Data Mgm...|About this role:\...|      6/12/2024|              10|37615159|         Wells Fargo|Wells Fargo|              false|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              1|Full-time (> 32 h...|                   3|                NULL|        false|  NULL|          0|          [None]|               NULL|     NULL|       NULL|{\n  "lat": 33.44...|    UGhvZW5peCwgQVo=|  Phoenix, AZ|  4013|  Maricopa, AZ|38060|Phoenix-Mesa-Chan...|    4|   Arizona|           4013|        Maricopa, AZ|           4013|        Maricopa, AZ|       38060|Phoenix-Mesa-Chan...|       38060|Phoenix-Mesa-Chan...|    52|Finance and Insur...|   522|Credit Intermedia...|  5221|Depository Credit...| 52211|  Commercial Banking|522110|  Commercial Banking|ET2114E0404BA30075|Management Analysts|sr lead data mgmt...|[\n  "KS123QX62QY...|[\n  "Exit Strate...|[\n  "KS123QX62QY...|   [\n  "Exit Strate...|                  []|                  []|[\n  "KS7G6NP6R6L...|[\n  "Reliability...|[\n  "KS4409D76NW...|[\n  "SAS (Softwa...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231113|Data / Data Minin...|                  23111310|                   Data Analyst|                2311|     Data Analysis and...|                     23111310|                      Data Analyst|           231113|  Data / Data Minin...|                   2311|        Data Analysis and...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|        [\n  6\n]|  [\n  "Data Privac...|          52|Finance and Insur...|         522|Credit Intermedia...|        5221|Depository Credit...|       52211|  Commercial Banking|      522110|  Commercial Banking|
|cb5ca25f02bdf25c1...|        6/19/2024|   2024-06-19 07:00:00|         0|6/2/2024|6/17/2024|      15|[\n  "FreeJobBoar...|[\n  "craigslist....|[\n  "https://mod...|         []|               NULL|Comisiones de $10...|Comisiones de $10...|      6/17/2024|              15|       0|        Unclassified|      LH/GM|              false|      [\n  99\n]| [\n  "No Educatio...|           99|No Education Listed|         NULL|              NULL|              3|Part-time / full-...|                NULL|                NULL|        false| 92500|          0|          [None]|               year|   150000|      35000|{\n  "lat": 37.63...|    TW9kZXN0bywgQ0E=|  Modesto, CA|  6099|Stanislaus, CA|33700|         Modesto, CA|    6|California|           6099|      Stanislaus, CA|           6099|      Stanislaus, CA|       33700|         Modesto, CA|       33700|         Modesto, CA|    99|Unclassified Indu...|   999|Unclassified Indu...|  9999|Unclassified Indu...| 99999|Unclassified Indu...|999999|Unclassified Indu...|ET0000000000000000|       Unclassified|comisiones de por...|                  []|                  []|                  []|                     []|                  []|                  []|                  []|                  []|                  []|                  []|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|                  []|                  []|                  []|                  []|                  []|                  []|   15-0000|Computer and Math...|   15-2000|Mathematical Scie...|   15-2050|Data Scientists|   15-2051|Data Scientists|             23|Information Techn...|        231010|Business Intellig...|                  23101012|           Oracle Consultant...|                2310|     Business Intellig...|                     23101012|              Oracle Consultant...|           231010|  Business Intellig...|                   2310|        Business Intellig...|                23|   Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists|             NULL|                  NULL|          99|Unclassified Indu...|         999|Unclassified Indu...|        9999|Unclassified Indu...|       99999|Unclassified Indu...|      999999|Unclassified Indu...|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
only showing top 5 rows

Salary Distribution by Employment TypeΒΆ

  • Identify salary trends across different employment types.
  • Filter the dataset
    • Remove records where salary is missing or zero.
  • Aggregate Data
    • Group by employment type and compute salary distribution.
  • Visualize results
    • Create a box plot where:
      • X-axis = EMPLOYMENT_TYPE_NAME
      • Y-axis = SALARY_FROM
    • Customize colors, fonts, and styles to avoid a 2.5-point deduction.
  • Explanation: Write two sentences about what the graph reveals.
InΒ [2]:
df_clean = df.filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") > 0))
print(f"Rows before cleaning: {df.count()}")
print(f"Rows after cleaning: {df_clean.count()}")
df_clean.select("EMPLOYMENT_TYPE_NAME", "SALARY_FROM").show(5)
pdf = df_clean.select("EMPLOYMENT_TYPE_NAME", "SALARY_FROM").toPandas()
print(pdf.head())
fig = px.box(
    pdf,
    x="EMPLOYMENT_TYPE_NAME",
    y="SALARY_FROM",
    title="Salary Distribution by Employment Type",
    color_discrete_sequence=["#636EFA"]
)
fig.update_layout(
    font_family="Arial",
    title_font_size=16,
    xaxis_title="Employment Type",
    yaxis_title="Salary (USD)",
    template="plotly_white"
)
fig.show()
                                                                                
Rows before cleaning: 72498
                                                                                
Rows after cleaning: 32398
+--------------------+-----------+
|EMPLOYMENT_TYPE_NAME|SALARY_FROM|
+--------------------+-----------+
|Part-time / full-...|      35000|
|Full-time (> 32 h...|      94420|
|Full-time (> 32 h...|      79500|
|Full-time (> 32 h...|      91559|
|Full-time (> 32 h...|     144600|
+--------------------+-----------+
only showing top 5 rows

                                                                                
     EMPLOYMENT_TYPE_NAME  SALARY_FROM
0   Part-time / full-time        35000
1  Full-time (> 32 hours)        94420
2  Full-time (> 32 hours)        79500
3  Full-time (> 32 hours)        91559
4  Full-time (> 32 hours)       144600

The box plot reveals significant salary differences across various employment types. Full-time jobs (more than 32 hours) generally have a higher median salary and a wider distribution. Some extreme outliers suggest exceptionally high-paying positions, likely due to specific industries or executive roles. In contrast, part-time jobs tend to have a lower median salary with a more concentrated distribution, indicating relatively stable pay levels.ΒΆ

Additionally, the salary range varies significantly across employment types. Full-time roles exhibit a broader salary spectrum, whereas part-time positions show a narrower spread. This variation may result from industry-specific compensation structures. Furthermore, there is an encoding issue in some employment type labels (e.g., "Part-time (Ò‰Β₯ 32 hours)"), likely caused by character encoding inconsistencies. Addressing this formatting issue in the dataset could enhance visualization accuracy and readability. Further analysis could refine the classification and explore the underlying factors contributing to salary discrepancies.ΒΆ

Salary Distribution by IndustryΒΆ

  • Compare salary variations across industries.
  • Filter the dataset
    • Keep records where salary is greater than zero.
  • Aggregate Data
    • Group by NAICS industry codes.
  • Visualize results
    • Create a box plot where:
      • X-axis = NAICS2_NAME
      • Y-axis = SALARY_FROM
    • Customize colors, fonts, and styles.
  • Explanation: Write two sentences about what the graph reveals.
InΒ [3]:
df_clean = df.filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") > 0))
df_grouped = df_clean.select("NAICS2_NAME", "SALARY_FROM")

print(f"Rows after cleaning: {df_clean.count()}")
df_grouped.show(5)
pdf = df_grouped.toPandas()
print(pdf.head())
fig = px.box(
    pdf,
    x="NAICS2_NAME",
    y="SALARY_FROM",
    title="Salary Distribution by Industry",
    color_discrete_sequence=["#636EFA"]
)

fig.update_layout(
    font_family="Arial",
    title_font_size=16,
    xaxis_title="Industry (NAICS2)",
    yaxis_title="Salary (USD)",
    template="plotly_white"
)

fig.show()
                                                                                
Rows after cleaning: 32398
+--------------------+-----------+
|         NAICS2_NAME|SALARY_FROM|
+--------------------+-----------+
|Unclassified Indu...|      35000|
|         Information|      94420|
|Professional, Sci...|      79500|
|     Wholesale Trade|      91559|
|Professional, Sci...|     144600|
+--------------------+-----------+
only showing top 5 rows

                                                                                
                                        NAICS2_NAME  SALARY_FROM
0                             Unclassified Industry        35000
1                                       Information        94420
2  Professional, Scientific, and Technical Services        79500
3                                   Wholesale Trade        91559
4  Professional, Scientific, and Technical Services       144600

The box plot reveals significant salary variations across different industries. Industries such as Information Technology, Finance and Insurance, and Professional, Scientific, and Technical Services have higher median salaries and a broader salary distribution. This suggests that these industries include a wide range of positions, from entry-level roles to senior executive positions, resulting in substantial salary gaps. Additionally, the presence of high-salary outliers indicates the demand for specialized professionals and executive roles in these fields.ΒΆ

In contrast, industries such as Retail Trade, Educational Services, and Accommodation and Food Services have lower median salaries with a more concentrated distribution. This indicates relatively stable wage structures with less fluctuation. The Unclassified Industry category also contains lower salary ranges, possibly due to incomplete data classification or diverse salary structures within unspecified sectors. Further analysis could focus on filtering out outliers or investigating specific job roles that contribute to salary disparities within each industry.ΒΆ

Job Posting Trends Over TimeΒΆ

  • Analyze how job postings fluctuate over time.
  • Aggregate Data
    • Count job postings per posted date (POSTED).
  • Visualize results
    • Create a line chart where:
      • X-axis = POSTED
      • Y-axis = Number of Job Postings
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
InΒ [4]:
from pyspark.sql.functions import col, to_date, count
df.select("POSTED").distinct().show(10, False)
df_clean = df.withColumn("POSTED", to_date(col("POSTED"), "M/d/yyyy"))
df_clean.select("POSTED").distinct().show(10, False)
df_grouped = df_clean.groupBy("POSTED").agg(count("*").alias("Job_Count")).orderBy("POSTED")
print(f"Rows after grouping: {df_grouped.count()}")
df_grouped.show(10)
pdf = df_grouped.toPandas()
fig = px.line(
    pdf,
    x="POSTED",
    y="Job_Count",
    title="Job Posting Trends Over Time",
    line_shape="linear"
)

fig.update_layout(
    font_family="Arial",
    title_font_size=16,
    xaxis_title="Date Posted",
    yaxis_title="Number of Job Postings",
    template="plotly_white"
)
fig.show()
                                                                                
+---------+
|POSTED   |
+---------+
|8/6/2024 |
|5/8/2024 |
|9/19/2024|
|6/24/2024|
|9/12/2024|
|5/28/2024|
|7/16/2024|
|8/20/2024|
|6/15/2024|
|8/28/2024|
+---------+
only showing top 10 rows

                                                                                
+----------+
|POSTED    |
+----------+
|2024-09-18|
|2024-05-30|
|2024-06-12|
|2024-08-27|
|2024-06-04|
|2024-05-25|
|2024-08-30|
|2024-07-08|
|2024-09-10|
|2024-08-05|
+----------+
only showing top 10 rows

                                                                                
Rows after grouping: 154
                                                                                
+----------+---------+
|    POSTED|Job_Count|
+----------+---------+
|      NULL|       22|
|2024-05-01|      506|
|2024-05-02|      437|
|2024-05-03|      679|
|2024-05-04|      573|
|2024-05-05|      159|
|2024-05-06|      169|
|2024-05-07|      516|
|2024-05-08|      471|
|2024-05-09|      619|
+----------+---------+
only showing top 10 rows

                                                                                

As can be seen from the line chart, the number of job posts shows an obvious fluctuation trend in the time dimension. The data covers the period from May to September 2024, during which the number of job postings per day fluctuated between 200-1,000, showing a dynamic shift in hiring demand. A spike in job postings at certain times may be related to a company's peak hiring season or increased demand for talent in a specific industry, such as the beginning of the month or the end of the quarter, which may be the peak of hiring activity. In contrast, the number of job openings decreases during certain periods of time, which may be related to weekends, holidays, or slow seasons for industry hiring.ΒΆ

In addition, some of the POSTED data is NULL, but most of the data has been successfully converted to standard date format, ensuring the accuracy of trend analysis. For further optimization, the data can be smoothed, such as calculating a 7-day rolling average, which reduces the impact of daily fluctuations and makes long-term trends more visible.ΒΆ

Top 10 Job Titles by CountΒΆ

  • Identify the most frequently posted job titles.
  • Aggregate Data
    • Count the occurrences of each job title (TITLE_NAME).
    • Select the top 10 most frequent titles.
  • Visualize results
    • Create a bar chart where:
      • X-axis = TITLE_NAME
      • Y-axis = Job Count
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
InΒ [5]:
from pyspark.sql.functions import col, count
df_grouped = df.groupBy("TITLE_NAME").agg(count("*").alias("Job_Count"))
df_top10 = df_grouped.orderBy(col("Job_Count").desc()).limit(10)
df_top10.show(10, False)
pdf = df_top10.toPandas()
print(pdf.head())
fig = px.bar(
    pdf,
    x="TITLE_NAME",
    y="Job_Count",
    title="Top 10 Job Titles by Count",
    text="Job_Count",
    color_discrete_sequence=["#636EFA"]
)
fig.update_layout(
    font_family="Arial",
    title_font_size=16,
    xaxis_title="Job Title",
    yaxis_title="Job Count",
    template="plotly_white"
)
fig.show()
                                                                                
+------------------------------+---------+
|TITLE_NAME                    |Job_Count|
+------------------------------+---------+
|Data Analysts                 |8591     |
|Unclassified                  |3149     |
|Business Intelligence Analysts|2072     |
|Enterprise Architects         |1999     |
|Oracle Cloud HCM Consultants  |1042     |
|Data Modelers                 |668      |
|Data Governance Analysts      |628      |
|Data Analytics Engineers      |537      |
|ERP Business Analysts         |488      |
|Data Quality Analysts         |467      |
+------------------------------+---------+

[Stage 42:>                                                         (0 + 1) / 1]
                       TITLE_NAME  Job_Count
0                   Data Analysts       8591
1                    Unclassified       3149
2  Business Intelligence Analysts       2072
3           Enterprise Architects       1999
4    Oracle Cloud HCM Consultants       1042
                                                                                

The bar chart reveals that Data Analysts is the most frequently posted job title, with 8,591 job postings, significantly outnumbering other roles. This indicates a high demand for data analytics skills across industries, especially as data-driven decision-making becomes increasingly critical. Additionally, Business Intelligence Analysts and Enterprise Architects are among the top positions, reflecting the growing need for data management, system architecture, and business intelligence expertise.ΒΆ

Notably, Unclassified ranks second, suggesting that some job postings lack standardized classification, which may affect data accuracy. If we exclude unclassified jobs, technical roles such as Data Modelers and Data Governance Analysts still dominate the list, highlighting the market’s emphasis on data management and quality control. In the future, further refining job classifications could provide deeper insights into industry-specific demands for these roles.ΒΆ

Remote vs On-Site Job PostingsΒΆ

  • Compare the proportion of remote and on-site job postings.
  • Aggregate Data
    • Count job postings by remote type (REMOTE_TYPE_NAME).
  • Visualize results
    • Create a pie chart where:
      • Labels = REMOTE_TYPE_NAME
      • Values = Job Count
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
InΒ [6]:
df_grouped = df.groupBy("REMOTE_TYPE_NAME").agg(count("*").alias("Job_Count"))
df_grouped.show(10, False)
pdf = df_grouped.toPandas()
print(pdf.head())

fig = px.pie(
    pdf,
    names="REMOTE_TYPE_NAME",
    values="Job_Count",
    title="Remote vs On-Site Job Postings",
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig.update_layout(
    font_family="Arial",
    title_font_size=16,
    template="plotly_white"
)
fig.show()
                                                                                
+----------------+---------+
|REMOTE_TYPE_NAME|Job_Count|
+----------------+---------+
|Remote          |12497    |
|[None]          |56570    |
|NULL            |44       |
|Not Remote      |1127     |
|Hybrid Remote   |2260     |
+----------------+---------+

[Stage 48:>                                                         (0 + 1) / 1]
  REMOTE_TYPE_NAME  Job_Count
0           Remote      12497
1           [None]      56570
2             None         44
3       Not Remote       1127
4    Hybrid Remote       2260
                                                                                

The pie chart shows that the majority of job postings are labeled as [None], accounting for 78%, which likely means these positions do not explicitly specify whether they are remote or on-site. As a result, they are likely predominantly on-site jobs. Remote positions make up 17.2%, indicating that remote work has become a notable trend, though it remains significantly lower than unspecified positions.ΒΆ

Additionally, Hybrid Remote positions account for 3.12%, suggesting that some companies have adopted a hybrid work model where employees alternate between working from home and the office. Fully on-site jobs (Not Remote) constitute only 1.55%, possibly overlapping with the [None] category. NULL values make up 0.06%, which may be due to missing data. Overall, while remote work is growing in popularity, traditional on-site work still dominates the job market, though the rise of remote and hybrid work models is worth noting.ΒΆ

Skill Demand Analysis by Industry (Stacked Bar Chart)ΒΆ

  • Identify which skills are most in demand in various industries.
  • Aggregate Data
    • Extract skills from job postings.
    • Count occurrences of skills grouped by NAICS industry codes.
  • Visualize results
    • Create a stacked bar chart where:
      • X-axis = Industry
      • Y-axis = Skill Count
      • Color = Skill
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
InΒ [7]:
from pyspark.sql.functions import col, count, explode, split, trim
df_exploded = df.withColumn("SKILL", explode(split(col("SKILLS_NAME"), ",|, ")))
df_exploded = df_exploded.filter(col("SKILL").isNotNull() & (col("SKILL") != ""))
df_exploded = df_exploded.withColumn("SKILL", trim(col("SKILL")))
df_grouped = df_exploded.groupBy("NAICS2_NAME", "SKILL").agg(count("*").alias("Skill_Count"))
top_skills = df_grouped.groupBy("SKILL").sum("Skill_Count").orderBy(col("sum(Skill_Count)").desc()).limit(10)
df_filtered = df_grouped.join(top_skills, "SKILL", "inner")
df_filtered = df_filtered.filter(col("NAICS2_NAME").isNotNull())
pdf = df_filtered.toPandas()
import plotly.express as px
fig = px.bar(
    pdf,
    x="NAICS2_NAME",
    y="Skill_Count",
    color="SKILL",
    title="Skill Demand Analysis by Industry",
    text="Skill_Count",
    barmode="stack",
    color_discrete_sequence=px.colors.qualitative.Set3
)
fig.update_layout(
    font_family="Arial",
    title_font_size=16,
    xaxis_title="Industry (NAICS2)",
    yaxis_title="Skill Count",
    template="plotly_white",
    xaxis_tickangle=-30, 
    width=1600,
    height=800,
    legend=dict(
        orientation="v",
        yanchor="top",
        y=1,
        xanchor="left",
        x=1.02
    )
)
fig.show()
                                                                                

The chart illustrates the demand for different skills across various industries. The stacked bar chart categorizes industries, with the height of each bar representing the total skill demand in that industry, while the different colored sections indicate the proportion of each skill. It is evident that the Professional, Scientific, and Technical Services industry has the highest overall skill demand, significantly surpassing other industries. Additionally, the Finance and Insurance, Information, and Transportation and Warehousing industries also show a strong demand for specific skills.ΒΆ

The distribution of skills varies significantly across industries. For instance, "Project Management", "Communication", and "SQL (Programming Language)" are in high demand across multiple industries, whereas some specialized skills, such as "Business Process" or "Data Analysis", are more prominent in specific sectors. Overall, this chart provides insights into skill demand trends across industries, offering valuable market intelligence for both job seekers and businesses.ΒΆ

Salary Analysis by ONET Occupation Type (Bubble Chart)ΒΆ

  • Analyze how salaries differ across ONET occupation types.
  • Aggregate Data
    • Compute median salary for each occupation in the ONET taxonomy.
  • Visualize results
    • Create a bubble chart where:
      • X-axis = ONET_NAME
      • Y-axis = Median Salary
      • Size = Number of job postings
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
InΒ [8]:
from pyspark.sql.functions import col, count, expr, percentile_approx
df_grouped = df.groupBy("ONET_NAME").agg(
    percentile_approx("SALARY_FROM", 0.5).alias("Median_Salary"),  # Median salary
    count("*").alias("Job_Count")  # Number of job postings
)
pdf = df_grouped.toPandas()
import plotly.express as px

fig = px.scatter(
    pdf,
    x="ONET_NAME",
    y="Median_Salary",
    size="Job_Count",
    title="Salary Analysis by ONET Occupation Type",
    color="ONET_NAME",
    size_max=50
)
fig.update_layout(
    font_family="Arial",
    title_font_size=16,
    xaxis_title="ONET Occupation",
    yaxis_title="Median Salary (USD)",
    template="plotly_white",
    xaxis_tickangle=-45  
)

fig.show()
                                                                                

From the data analysis results, the dataset contains only one unique occupation category in the ONET_NAME field, which is "Business Intelligence Analysts", with a total of 72,454 records. Additionally, after checking for NULL values in ONET_NAME, it was confirmed that there are no NULL entries, meaning there are no other occupation types in the dataset. Since the entire dataset consists of only this one occupation, the visualization analysis (such as the bubble chart) results in only a single bubble being displayed, rather than a distribution of multiple occupations’ salaries. This suggests that the data might have been filtered during import or cleaning, or the original dataset itself only included this profession. It is recommended to further examine the data source to confirm whether multiple occupation categories should be included and reload the data accordingly. If the goal is to compare salaries across different occupations, ensuring a dataset with diverse ONET_NAME values is necessary.ΒΆ

Career Pathway Trends (Sankey Diagram)ΒΆ

  • Visualize job transitions between different occupation levels.
  • Aggregate Data
    • Identify career transitions between SOC job classifications.
  • Visualize results
    • Create a Sankey diagram where:
      • Source = SOC_2021_2_NAME
      • Target = SOC_2021_3_NAME
      • Value = Number of transitions
    • Apply custom colors and font styles.
  • Explanation: Write two sentences about what the graph reveals.
InΒ [9]:
from pyspark.sql.functions import col, count
import pandas as pd
import plotly.graph_objects as go

df_filtered = df.filter(col("SOC_2021_2_NAME").isNotNull() & col("SOC_2021_3_NAME").isNotNull())
df_grouped = df_filtered.groupBy("SOC_2021_2_NAME", "SOC_2021_3_NAME").agg(count("*").alias("count"))

pdf = df_grouped.toPandas()
labels = list(set(pdf["SOC_2021_2_NAME"]).union(set(pdf["SOC_2021_3_NAME"])))
label_map = {label: i for i, label in enumerate(labels)}

sources = [label_map[src] for src in pdf["SOC_2021_2_NAME"]]
targets = [label_map[tgt] for tgt in pdf["SOC_2021_3_NAME"]]
values = pdf["count"]

fig = go.Figure(go.Sankey(
    node=dict(
        pad=15, thickness=20, line=dict(color="black", width=0.5), label=labels
    ),
    link=dict(source=sources, target=targets, value=values)
))
fig.update_layout(title_text="Career Pathway Trends (Sankey Diagram)", font_size=12)
fig.show()
                                                                                

This Sankey diagram illustrates career pathway trends, but the current result shows that all job transitions are concentrated between "Computer and Mathematical Occupations" and "Mathematical Science Occupations." This indicates that the dataset contains only one career transition path, without other industries or more granular job movements. Since the entire flow is singular and lacks branches, the diagram appears as a simple connection rather than a complex career progression map. To make the visualization more informative, it may be necessary to check the dataset to ensure that SOC_2021_2_NAME and SOC_2021_3_NAME cover multiple different occupation categories or utilize a more detailed occupational classification level. Otherwise, while the current diagram is technically correct, it fails to effectively display a comprehensive career transition trend.ΒΆ